{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:100% !important; }</style>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 3A - Apply & Map, Misc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Lecture 3A - Apply & Map, Misc](#Lecture-3A---Apply-&-Map,-Misc)\n",
    "\t* &nbsp;\n",
    "\t\t* [Content](#Content)\n",
    "\t\t* [Learning Outcomes](#Learning-Outcomes)\n",
    "\t* [1. Functions and Dataframes - Using *apply()* and *applymap()*](#1.-Functions-and-Dataframes---Using-*apply%28%29*-and-*applymap%28%29*)\n",
    "\t\t* [Functions along an axis](#Functions-along-an-axis)\n",
    "\t\t* [Functions applied element-wise](#Functions-applied-element-wise)\n",
    "\t* [Dummy Variables](#Dummy-Variables)\n",
    "\t* [2. Removing Duplicates](#2.-Removing-Duplicates)\n",
    "\t* [3. Transpose](#3.-Transpose)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Applying functions to dataframes\n",
    "2. Removing duplicates\n",
    "3. Re-shaping dataframes with transpose\n",
    "4. Shift operations for time series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Learning Outcomes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the end of this lecture, you should be able to:\n",
    "\n",
    "* apply functions to dataframes\n",
    "* remove duplicate rows in dataframes\n",
    "* transpose dataframes\n",
    "* apply shift operations to dataframes for time series data\n",
    "\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "from pylab import rcParams\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set some Pandas options as you like\n",
    "pd.set_option('max_columns', 30)\n",
    "pd.set_option('max_rows', 30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rcParams['figure.figsize'] = 15, 10\n",
    "rcParams['font.size'] = 20"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Functions and Dataframes - Using *apply()* and *applymap()* "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Built-in or user-defines functions can be applied along the entire axes of a dataframe.\n",
    "\n",
    "To apply a function to an entire axis (or multiple axes) of a dataframe, we resort to the apply() method, which can take an optional axis argument to determine if the axis is vertical/column-wise (0) or horizontal/row-wise (1)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Functions along an axis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),\n",
    "                'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),\n",
    "                'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})\n",
    "\n",
    "df = df[['one','two','three']]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is an example of applying a built in sum function "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.apply(np.sum, axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**: Apply the mean function to the above dataframe in a row-wise manner."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**: Apply the sum function to columns 'one' and 'two' only in a row-wise manner, and assign the result to a new column in the dataframe called 'four'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**: Replace the missing value in both columns with the row-wise mean value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**: Calculate the column-wise product for the first and third columns only.     "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**: Write a function which calculates the sum of a vector and then returns the square of the sum. Once you have done this, apply your function to the dataframe in a row-wise manner, whilst creating a new column 'five', to which you will add insert the result. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def square_of_summed_vector(x):\n",
    "\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Functions applied element-wise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The apply() method produces some form of aggregate calculations on the axes of a dataframe.  applymap() on the other hand extends us the flexibility of applying functions which manipulate single elements in a dataframe.\n",
    "\n",
    "Say we would like to define a function which returns 'pos' for a positive number and alternatively 'neg'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pos_neg_to_string(x):\n",
    "    if x >= 0:\n",
    "        return 'pos'\n",
    "    else: return 'neg'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can apply this to our dataframe as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.applymap(pos_neg_to_string)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having the ability to apply element-wise operations on dataframes is extremely useful when it comes to dataset cleaning and transformations.\n",
    "\n",
    "Let's take a look at a sample from a real-world dataset used for gathering results from a survey:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assig = pd.read_csv(\"../datasets/surveySample.csv\")\n",
    "assig.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assig.OCCUPATION_M.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clearly the values in this column need to be cleaned up.\n",
    "\n",
    "Let's first find out what all the unique values are in this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assig.OCCUPATION_M.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now write a function that removes the first 3 characters in each entry in order to tidy the values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def remove_first_three_chars(x):\n",
    "    return x.replace(x[:3], '')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "assig[['OCCUPATION_M']].applymap(remove_first_three_chars)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to make the change permanent, we need to assign the result to the dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assig['OCCUPATION_M'] = assig[['OCCUPATION_M']].applymap(remove_first_three_chars)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Dummy Variables\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "A dummy variable is a numerical variable used in data analysis to represent subgroups of the sample in under study. \n",
    "\n",
    "In research design, a dummy variable is often used to distinguish different treatment groups. This is accomplished by taking distinct values from a column and creating new columns out of them which are populated with 0 or 1 in order to indicate whether or not the particular data point belongs to this. \n",
    "\n",
    "This is a frequent operation that can be easily in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "assig['OCCUPATION_M'].str.get_dummies()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also specify if there are multiple values within some cells that should be treated as separate columns. In this example we will say that the forward slash indicates a distinct value for which we would like to generate a column for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assig['OCCUPATION_M'].str.get_dummies('/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** From the assignment dataset, consider the column 'supermarket spend in a week'. The '\\$' character can cause issues in some applications. We want to clean up this column in such a way that the first 3 characters are replaced as well as the '\\$' character, and we also want to change entries with 'No Answer' to reflect that they are actually missing values so replace them with np.NaN. Write a function to do this and apply this function to this column.\n",
    "\n",
    "Verify that your code works. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Removing Duplicates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Duplicate rows may be naturally occurring in some datasets or they might arise from input errors. In many instances, like machine learning, these duplicate entries need to be removed from the datasets. \n",
    "\n",
    "Dataframes provide straightforward functionality to remove such records.\n",
    "\n",
    "Here is an example:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({'c1': ['one'] * 3 + ['two'] * 4,\n",
    "                  'c2': [1, 1, 2, 3, 3, 4, 4]})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`drop_duplicates` returns a DataFrame where the duplicated rows **across all columns** are dropped:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also pass a particular column we  would like the duplicates removed from. Let's first make a change to the dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[1, 'c1'] = 'five'\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.drop_duplicates(['c2'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that `drop_duplicates` by default keep the first observed value combination."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Transpose"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transposing is a special form of reshaping tabular data in such a way that the rows become columns and likewise the columns become rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),\n",
    "                'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),\n",
    "                'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})\n",
    "\n",
    "df = df[['one','two','three']]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transpose of a dataframe can be accomplished using either the transpose() method call  or simple .T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transpose operations are not permanent unless you re-assign the result back tothe original dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Slice and select out a dataframe with rows 'c' and 'd' and columns 'one' and 'two', then execute a transpose.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
